Simple Statistics in Python

The actions in CAS cover a wide variety of statistical analyses. While we can't cover all of them here, we'll at least get you started on some of the simpler ones.

First we need to get a CAS connection set up.


In [1]:
import swat

conn = swat.CAS(host, port, username, password)

The simple Action Set

The basic statistics package in CAS is called simple and should be already loaded. If you are using IPython, you can see what actions are available using the ? operator.


In [2]:
conn.simple?

You can also use Python's help function.


In [3]:
help(conn.simple)


Help on Simple in module swat.cas.actions object:

class Simple(CASActionSet)
 |  Analytics
 |  
 |  Actions
 |  -------
 |  simple.correlation : Generates a matrix of Pearson product-moment correlation
 |                       coefficients
 |  simple.crosstab    : Performs one-way or two-way tabulations
 |  simple.distinct    : Computes the distinct number of values of the variables in
 |                       the variable list
 |  simple.freq        : Generates a frequency distribution for one or more
 |                       variables
 |  simple.groupby     : Builds BY groups in terms of the variable value
 |                       combinations given the variables in the variable list
 |  simple.mdsummary   : Calculates multidimensional summaries of numeric variables
 |  simple.numrows     : Shows the number of rows in a Cloud Analytic Services table
 |  simple.paracoord   : Generates a parallel coordinates plot of the variables in
 |                       the variable list
 |  simple.regression  : Performs a linear regression up to 3rd-order polynomials
 |  simple.summary     : Generates descriptive statistics of numeric variables such
 |                       as the sample mean, sample variance, sample size, sum of
 |                       squares, and so on
 |  simple.topk        : Returns the top-K and bottom-K distinct values of each
 |                       variable included in the variable list based on a user-
 |                       specified ranking order
 |  
 |  Method resolution order:
 |      Simple
 |      CASActionSet
 |      builtins.object
 |  
 |  Data and other attributes defined here:
 |  
 |  actions = {'correlation': <class 'swat.cas.actions.simple.Correlation'...
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from CASActionSet:
 |  
 |  __call__(self, *args, **kwargs)
 |  
 |  __dir__(self)
 |  
 |  __getattr__(self, name)
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from CASActionSet:
 |  
 |  from_reflection(asinfo, connection) from builtins.type
 |      Create a CASActionSet class from reflection information
 |      
 |      Parameters
 |      ----------
 |      asinfo : dict
 |          Reflection information from the server
 |      connection : CAS object
 |          The connection object to associate with the CASActionSet
 |      
 |      Returns
 |      -------
 |      CASActionSet class
 |  
 |  get_connection() from builtins.type
 |      Retrieve the registered connection
 |      
 |      Since the connection is only held using a weak reference,
 |      this method will raise a SWATError if the connection object
 |      no longer exists.
 |      
 |      Returns
 |      -------
 |      CAS object
 |          The registered connection object
 |      
 |      Raises
 |      ------
 |      SWATError
 |          If the connection object no longer exists
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from CASActionSet:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from CASActionSet:
 |  
 |  trait_names = None

Let's start off with the summary action. We'll need some data, so we'll load some CSV from a local file. Then we'll run the action on it.


In [4]:
cars = conn.read_csv('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
out = cars.summary()
out


Out[4]:
§ Summary
Descriptive Statistics for _T_8B115BFF_7FB304E630C0
Column Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
0 MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127
1 Invoice 9875.0 173560.0 428.0 0.0 30014.700935 12846292.0 17642.117750 852.763949 3.112443e+08 5.184789e+11 1.329013e+11 58.778256 35.196963 2.684398e-128
2 EngineSize 1.3 8.3 428.0 0.0 3.196729 1368.2 1.108595 0.053586 1.228982e+00 4.898540e+03 5.247754e+02 34.679034 59.656105 3.133745e-209
3 Cylinders 3.0 12.0 426.0 2.0 5.807512 2474.0 1.558443 0.075507 2.428743e+00 1.540000e+04 1.032216e+03 26.834946 76.913766 1.515569e-251
4 Horsepower 73.0 500.0 428.0 0.0 215.885514 92399.0 71.836032 3.472326 5.160415e+03 2.215110e+07 2.203497e+06 33.275059 62.173176 4.185344e-216
5 MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257
6 MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292
7 Weight 1850.0 7190.0 428.0 0.0 3577.953271 1531364.0 758.983215 36.686838 5.760555e+05 5.725125e+09 2.459757e+08 21.212776 97.526890 5.812547e-294
8 Wheelbase 89.0 144.0 428.0 0.0 108.154206 46290.0 8.311813 0.401767 6.908624e+01 5.035958e+06 2.949982e+04 7.685150 269.196577 0.000000e+00
9 Length 143.0 238.0 428.0 0.0 186.362150 79763.0 14.357991 0.694020 2.061519e+02 1.495283e+07 8.802687e+04 7.704349 268.525733 0.000000e+00

elapsed 0.0788s · user 0.093s · sys 0.093s · mem 8.8MB

The result object here is a CASResults object which is a subclass of a Python dictionary. In this case, we only have one key "Summary". The value for this key is a DataFrame. We can store the DataFrame in a variable so that it's easier to work with, then we can do any of the standard Pandas DataFrame operations on it. Here we are setting the first column as the index for the DataFrame so that we can do data selection easier later on.


In [5]:
df = out['Summary']
df.set_index(df.columns[0], inplace=True)
df


Out[5]:
Descriptive Statistics for _T_8B115BFF_7FB304E630C0
Min Max N NMiss Mean Sum Std StdErr Var USS CSS CV TValue ProbT
Column
MSRP 10280.0 192465.0 428.0 0.0 32774.855140 14027638.0 19431.716674 939.267478 3.775916e+08 6.209854e+11 1.612316e+11 59.288490 34.894059 4.160412e-127
Invoice 9875.0 173560.0 428.0 0.0 30014.700935 12846292.0 17642.117750 852.763949 3.112443e+08 5.184789e+11 1.329013e+11 58.778256 35.196963 2.684398e-128
EngineSize 1.3 8.3 428.0 0.0 3.196729 1368.2 1.108595 0.053586 1.228982e+00 4.898540e+03 5.247754e+02 34.679034 59.656105 3.133745e-209
Cylinders 3.0 12.0 426.0 2.0 5.807512 2474.0 1.558443 0.075507 2.428743e+00 1.540000e+04 1.032216e+03 26.834946 76.913766 1.515569e-251
Horsepower 73.0 500.0 428.0 0.0 215.885514 92399.0 71.836032 3.472326 5.160415e+03 2.215110e+07 2.203497e+06 33.275059 62.173176 4.185344e-216
MPG_City 10.0 60.0 428.0 0.0 20.060748 8586.0 5.238218 0.253199 2.743892e+01 1.839580e+05 1.171642e+04 26.111777 79.229235 1.866284e-257
MPG_Highway 12.0 66.0 428.0 0.0 26.843458 11489.0 5.741201 0.277511 3.296139e+01 3.224790e+05 1.407451e+04 21.387709 96.729204 1.665621e-292
Weight 1850.0 7190.0 428.0 0.0 3577.953271 1531364.0 758.983215 36.686838 5.760555e+05 5.725125e+09 2.459757e+08 21.212776 97.526890 5.812547e-294
Wheelbase 89.0 144.0 428.0 0.0 108.154206 46290.0 8.311813 0.401767 6.908624e+01 5.035958e+06 2.949982e+04 7.685150 269.196577 0.000000e+00
Length 143.0 238.0 428.0 0.0 186.362150 79763.0 14.357991 0.694020 2.061519e+02 1.495283e+07 8.802687e+04 7.704349 268.525733 0.000000e+00

Now that we have an index, we can use the loc property of the DataFrame to select rows based on index values as well as columns based on names.


In [6]:
df.loc[['MSRP', 'Invoice'], ['Min', 'Mean', 'Max']]


Out[6]:
Min Mean Max
Column
MSRP 10280.0 32774.855140 192465.0
Invoice 9875.0 30014.700935 173560.0

DataFrame methods on CASTable objects

In the previous example, we called the summary action directly. This gave us a CASResults object that contained a DataFrame with the result of the action. You can also use many of the Pandas DataFrame methods directly on the CASTable object so that, in many ways, they are interchangeable. One of the most common methods used on a Pandas DataFrame is the describe method. This includes statistics that would normally be gotten by running variations of the summary, distinct, topk, and percentile actions. This is all done for you and the output created is the same as what you would get from an actual Pandas DataFrame. The difference is that in the case of the CASTable version, you can handle much, much larger data sets.


In [7]:
cars.describe()


Out[7]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20329.500000 18851.000000 2.350000 4.000000 165.000000 17.000000 24.000000 3103.000000 103.000000 178.000000
50% 27635.000000 25294.500000 3.000000 6.000000 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
75% 39215.000000 35732.500000 3.900000 6.000000 255.000000 21.500000 29.000000 3978.500000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

Other examples of DataFrame methods that work on CASTable objects are min, max, std, etc. Each of these calls simple.summary in the background, so if you want to use more than one, you might be better off just calling the describe method once to get all of them.


In [8]:
cars.min()


Out[8]:
Make                Acura
Model          3.5 RL 4dr
Type               Hybrid
Origin               Asia
DriveTrain            All
MSRP                10280
Invoice              9875
EngineSize            1.3
Cylinders               3
Horsepower             73
MPG_City               10
MPG_Highway            12
Weight               1850
Wheelbase              89
Length                143
Name: min, dtype: object

In [9]:
cars.max()


Out[9]:
Make                             Volvo
Model          Z4 convertible 3.0i 2dr
Type                             Wagon
Origin                             USA
DriveTrain                        Rear
MSRP                            192465
Invoice                         173560
EngineSize                         8.3
Cylinders                           12
Horsepower                         500
MPG_City                            60
MPG_Highway                         66
Weight                            7190
Wheelbase                          144
Length                             238
Name: max, dtype: object

In [10]:
cars.std()


Out[10]:
MSRP           19431.716674
Invoice        17642.117750
EngineSize         1.108595
Cylinders          1.558443
Horsepower        71.836032
MPG_City           5.238218
MPG_Highway        5.741201
Weight           758.983215
Wheelbase          8.311813
Length            14.357991
dtype: float64

Conclusion

Although we have just barely scratched the surface, you should now be able to get some basic statistical results back about your data. Whether you want to use the action API directly, or the familiar Pandas DataFrame methods is up to you.


In [11]:
conn.close()

In [ ]: